Reproducing Nutriscore: A Machine Learning approach
Reading this document
You may read this report in the format of your choosing:
Task description
We’d like to see if we can predict Nutriscore from foods using the OpenFoodFacts API. With a suite of clustering and classification models, we’d like to work backwards to see if we can work out what determines Nutriscore, then look at the actual calculation to see how close we got. While this exercise seems pointless in itself, the same technique could be used to work backwards to calculate closed source indices, such as those used by financial industries.
We’ll try a naïve approach to start with, using nutritional factors to predict Nutriscore using an array of unsupervised clustering techniques to see if they form natural groups. 1
1 If these groups exists, I suspect this is how the index was created in the first place.
Afterwards, we’ll use supervised classification techniques to create Nutriscore models. We’ll cross-validate them and check them against a validation set.
Data description
Our initial plan was to use python openfoodfacts library to pull in data directly from the API. Unfortunately, the API was unreliable and had frequent down periods. We opted to use a predownloaded dump of the OpenFoodFacts data from Kaggle. The kaggle data comes in the form of a zipped tab-separated file (.tsv). When extracted, it’s about 1GB in size and contains 356 027 rows and 163 columns.
Processing
In order to start using this data, we first had to filter and process it.
Nutriscore is one dimension available on OpenFoodFacts among others, including nutritional data per 100g 2:
2 We chose these nutrients as they’re the most common ones to find. Stepping out of these moves you from 90% of foods having them, to 90% without.
- Energy (kJ)
- Protein
- Sugar
- Sodium
- Salt
- Saturated fat
- Fat
- Carbohydrates
It also contains tags for all foods, making it easy to specify food types. Calculating the Nutriscore values for everything would be tedious and take a long time to analyse so we’re going to look at a few products:
- Breakfast cereals
- Iced teas3
- Biscuits and cakes
3 Iced tea varies a lot in sugar content as it’s often geared to the health market and so has options with either low sugar, or where some of the sugar has been replaced with artificial sweeteners.
All of these foods are highly variable in sugar and fat content so it should give us an idea of what causes Nutriscore to change.
Filtering
Even after reducing our dataset, we still noticed some unusual values. In particular, we noticed some salt contents that were peculiarly high.4 We chose to remove all values with higher than 2% salt and the corresponding value for sodium.5
4 Hello Panda may not be the healthiest treat, it’s unlikely to contain more than its own weight in salt.
5 Table salt is ~40% sodium, so our threshold was (\(2\times0.4=0.8\))
Given that the data we had retained was relatively complete, we opted to discard any rows with missing values rather than imputing. This caused a loss of ~15% of our data.
Exploration
Nutrients
The first thing to look at when you have a new dataset is distribution. Let’s look at overall distributions of each of our dependent variables.
A number of these have large outliers, but after sodium and salt were taken out in the processing step, none of them seem too terrible.6
6 Turns out the reason there’s a huge outlier here is the presence of some dried coconut that counts as a biscuit. We didn’t remove it as there are a number of biscuits from Brittany not too far away with astonishing levels of saturated fats.
The next stage of describing out nutrients would be to split these by type of food.
There are some very clear patterns in the nutrient distributions of food category.7 Iced tea behaves quite differently to the others as it has virtually no nutrients other than sugar and energy. Biscuits seem to have the most normally distributed nutrients, while cereals seem to have a number of healthy ones with low sodium and fat but a large spread with more of those nutrients.
7 Please note that the x-axes are free. I wasn’t able to fix it per column and it would have been pointless to fit it for the whole grid.
Nutriscores
Now that we’ve looked at the nutrients, we’ll turn our attention to the nutriscores.
Out data is mostly unhealthy food, it would seem. We have a lot of foods with scores of D and E, but not so many with scores of A and B. Given that most of the data in OpenFoodFacts is processed food, this isn’t surprising - but our choices of category are certainly not helping our cause.
Splitting up the Nutriscores by category, we can see that iced teas are generally unhealthy, with not a single one getting a score of A. Biscuits and cakes show an expected distribution as do breakfast cereals as there’s a lacuna between All-Bran and Coco Pops.8
8 I suspect that the reason that there are so many Cs are unhealthy cereals adding fibre to their cereals to get to the first “healthy-ish” tier.
Unsupervised clustering
The first question we wanted to answer was “Do Nutriscores form natural groups?”. While there is obviously an underlying model linking foods to Nutriscores, the question could be otherwise formulated: “If a machine creates clusters without knowledge of Nutriscores, will the clusters resemble Nutriscores?”
Classification
Conclusions
Future improvements
- Find more foods with Nutriscore A and B
- Perform Feature importance to to more easily find which variables are more important
- Do analysis which takes into account ordinal nature of Nutriscores